

A Portable Parser Tool

Wilfred J. Hansen
Andrew Consortium
Carnegie Mellon University


Abstract: Since yacc, Unix parser tools have been suffered from
non-portability, rigidity, and potential symbol conflicts.  The work
reported here avoids all the problems and produces either a distinct
module in C, or an entire class definition in C++.  The basic
components are a script to strip the proprietary parts from Bison output 
and a subclassable parser.


In olden days, yacc [Johnson] generated efficient source code, 
appropriately minimizing memory and CPU utilization.  The limitations 
adopted to achieve this efficiency have become increasingly 
annoying with the advant of larger, faster computers and more
integrated software.  Among the limitiations of yacc are these

	Name space conflicts - Because the code of the parser is compiled with
	user code, users must avoid certain names;  the parser uses names 
	beginning with 'yy'.

	Single parser - As a result of the naming conventions, it is not 
	possible--without special arrangements--to have more than one parser 
	in a program.

	Non-portable parsers - Each vendor distributes a slightly different 
	version of the parser, with different sets of global names.  These
	differences make it effectively impossible to build portable
	software based on yacc.

Given these problems, we decided to adopt the Bison parsing technology.
Unfortunately, the bison parser comes with a license that imposes
unacceptable conditions on software that incorporates that parser.
(There are no restrictions on using the Bison parser generator, just
on using the parser that Bison inserts in the code.)  Consequently,
at thesuggestion of Richard Stallman, we produced a postprocessor
for Bison output which solves the various problems.

Here are some of the goals we set out to accomplish:

	Standalone parser - The parser should a separate module 
	to which tables describe the particular grammar.  It should be
	possible to utilize the same parser code for all grammars
	and even to have more than one compilation in progress at
	a time.

	Single symbol - The system generates a module whose compilation
	defines a single symbol by which the grammar can be incorporated
	into the parser.


Separate stack per each compilation
	dynamically allocated and expanding as needed
	variable size stack elements

would like that there be no function call for reduction
	seems to require embedded code - licensing problem
	need some kludge like YYSTYPE
	On PMAX, function call was faster!  (probably an
	artifact, but still indicates that function call is not a problem)


on a PMAX, the .o file for a trivial grammar, with parser, is 4K  (4264).
Presumably some of this could be avoided if the parser code
could somehow be shared.  Offsetting this is the desire to
not have a function call for each reduction.

The tables alone for the trivial grammar are 1K  (1164)
Thus an object approach could save 3K per parser.

??? parse.o for the parse object is 28K.
but when extraneous routines are removed and the object is stripped,
the size is 4K  (3852)  This excludes debugging statements and tables.
By assuming that stack munging and error processing are elsewhere,
The size can be further reduced to 2948.  


on the same pmax, what is the cost of a function call 
10000000 executions of
a) empty loop  1.05 sec
b) f(){}	3.58 sec
c) f(i){foo=i;} 3.65 sec

one exec of (b)  2.53/10000000 = .253 usec
one exec of (c) .260 usec


THe prior time trial appears affected by code caching, so loops with 
more contents were tried.  And the function call approach adds the
requirement to use a memcpy, so that was timed as well.

loop with no contents: 		.078 usec	(first iteration was .234 usec)
function call - no memcpy:  	3.51 usec	(first iteration was 3.59 usec)
function call - memcpy:		5.62 usec	(first iteration was 5.70 usec)
inline code - no memcpy:  	4.08 usec	(first iteration was 4.06 usec)

Curiously, the function call case is FASTER than the inline case.  However,
the memcpy adds enough to make it more than the inline case, by 1.6 usec.

Conclusion: the execution cost of the parse object approach (function 
call per reduction and memcpy per shift) is under 2 usec per token 
more than the inline version.  If a 100,000 byte source text has 20,000
tokens, the extra cost is 40,000 usec, or .04 sec.


stack is in object for compilation
	starts at 200 elements and doubles with realloc
code is in object - make it free

#define PARSEROCKTYPE & LEXROCKTYPE within .gra file

.gra file has %{...%} as in .y files
support %union and typedef ... YYSTYPE
	use YYSTYPE in stack declarations
	if no %union and no %type and no <id>, then output
	#ifndef YYSTYPE; #define YYSTYPE int; #endif
errors are reported to foo_Error() (instead of yyerror)
	which must be defined in %{...%} and
	has parameters: parse-object, msg

In order to do actions as functions or switch elts
		 an action can end one of these ways:
	drop out - parsing continues normally
	YYACCEPT - parsing terminates and succeeds
	YYABORT - parsing terminates and fails
	YYERROR - parse enters error state
	YYERROROK - parse leaves error state and continues
	YYCLEARIN - pending input token is dropped
	YYCLINEROK - pending input token is dropped and
			parser leaves error state
	YYCLINERR - pending input token is dropped and
			parser enters error state

yylval is static within the .y module;  it is of type YYSTYPE

methods of parse class
	typedef int (*lexfptr)(void *lexrock, YYSTYPE *yylval);
	typedef (*tokefptr)(void *rock, char *token, int toknum);
	int Parse(void *parserock, lexfptr lexer, void *lexerrock);
	boolean parse::SetDebug(boolean newvalue);
	void EnumerateTokens(tokefptr proc, void *rock);
	int TokenNumberFromName(char *token);
	void Error(char *msg, void *parserock, void* lexrock);
		(must exist and must be user defined)

need to decide how to provide 'killval' function to discard stack values



Example


in rofparse.h:

	struct numrock {
		int r;
		FILE *e;
	};

in rofparse.C:

#include <rofparse.h>	/* declares numrock structure */.
#include <num.H>	/* declarations from parser generator */
#include <numtoks.H>	/* declarations from tlex */

...

	struct numtoks *numtoks;
	int (*numlex)(struct numtoks *);

	numtoks = numtoks_CreateState(text, start, length);

	struct numrock numrock;
	numrock.e = stderr;
	if (num_Parse(numtoks_Lexer, numtoks, &numrock) == parse_OK)
		*result = numrock.r;
	else {
		/* process parse failure */
	}
	numtoks_DestroyState(numtoks);


num.gra includes
	YYSTYPE
	productions
	error processor
	other related functions
num.C includes
	reduce actions
	parser
	create struct temporarily during compilation
	error function
	other related functions


/* timing program */
#include <sys/time.h>
int foo;
elapsed()  {
	static struct timeval prev;	
	long t;  struct timeval now;
	gettimeofday(&now, 0);
	t = (now.tv_sec - prev.tv_sec) * 1000000 
		+ (now.tv_usec - prev.tv_usec);
	prev = now;
	printf("%d\n", t);
}
f1() {}   f2(i) {foo = i;}
main()  {
	register int i;
	elapsed();  elapsed();  elapsed();
	for (i=10000000; i; i--) {}  elapsed();
	for (i=10000000; i; i--) {f1();}  elapsed();
	for (i=10000000; i; i--) {f1();}  elapsed();
	for (i=10000000; i; i--) {f1();}  elapsed();
	for (i=10000000; i; i--) {f2(i);}  elapsed();
	for (i=10000000; i; i--) {f2(i);}  elapsed();
	for (i=10000000; i; i--) {f2(i);}  elapsed();
}


DECISION - use parse object

run .y file through bison and then through a script
result is a .c file which implements the specifics for making a parse object

entry points to the .c file

foo_ParseTable()  returns the parse table


usage in object file


#include <parse.H>
#include <foo.H>	/* declares foo_ParseTable */
#include <tlex.H>
#include <foolex.H>	/* declarations from tlex */

...
	struct numtoks *numtoks;
	int (*numlex)(struct numtoks *);
	class parse *fooparse = parse_Create(foo_ParseTable());

	numtoks = numtoks_CreateState(text, start, length);

	struct numrock numrock;
	numrock.e = stderr;
	if ((fooparse)->Parse(numtoks_Lexer, numtoks, &numrock) == parse_OK)
		*result = numrock.r;
	else {
		/* process parse failure */
	}
	numtoks_DestroyState(numtoks);

various values can be set in various ways
	parameters to parse::Create  or
	set by methods on parse object  or
	parameters to parse::Parse
values:
	lexer
	lexer rock
	parse rock
	error handler
	

Appendix:  Timing program


/* bigger timing program */
#include <sys/time.h>

struct timeval prev;	

int foo;

#define M 200000

#define THRICE(a) a a a

#define BODY {\
	*b = a->s2;  \
	z = *c;  \
	INNARDS   \
}
#define INNARDS \
	c->rr1 = a->sr1;  \
	x.sr2 = &w;  \
	c->r1 = a->s1;  \
	px = a;  \
	pw = a->sr3;  \
	pw = c;  \
	pz = pw->rr2;



elapsed() 
{
	long t;
	struct timeval now;
	gettimeofday(&now, 0);

	t = (now.tv_sec - prev.tv_sec) * 1000000 
		+ (now.tv_usec - prev.tv_usec);

	prev = now;
	printf("\t%f\n", (t*1.0) / M);
}


struct s {
	int s1, s2, s3;
	struct r *sr1, *sr2, *sr3;
};
struct r {
	int r1, r2, r3;
	struct r *rr1, *rr2, *rr3;
};

struct s x, y, *px, *py;
struct r w, z, *pw, *pz;

f3(a, b, c)
	struct s *a;
	int *b;
	struct r *c;
{
	BODY;
}
f4(a, b, c)
	struct s *a;
	int *b;
	struct r *c;
{
	*b = a->s2;
	memcpy(c, &z, sizeof(*c));
	INNARDS;
}


main()
{
	register int i;
	elapsed();
printf("loop with no contents\n");
	THRICE(for (i=M; i; i--) {}   elapsed();)
printf("function call without memcpy\n");
	THRICE(for (i=M; i; i--) {f3(&x, &foo, &w);}  elapsed();)
printf("function call with memcpy\n");
	THRICE(for (i=M; i; i--) {f4(&x, &foo, &w);}  elapsed();)
printf("inline\n");
	THRICE(for (i=M; i; i--) { \
		struct s *a = &x; \
		int *b = &foo; \
		struct r *c = &w; \
		BODY; } 		elapsed();  }
}




From: Dr Adrian Johnstone, CS Dept, Royal Holloway, University of London
Email: adrian@dcs.rhbnc.ac.uk
Date: Sun, 27 Mar 1994 04:09:23 GMT

The first maintenance release of RDP (version 1.1) is now available via
ftp from cscx.cs.rhbnc.ac.uk (134.219.200.45) in directory /pub/rdp. Get
rdp1_1.zip if you are an MS-DOS user or rdp1_1.tar.Z if you are a Unix
user.

RDP - a recursive descent parser generator

RDP compiles attributed LL(1) grammars decorated with C-language semantic
actions into recursive descent compilers. RDP is written in strict ANSI C
and produces strict ANSI C. RDP was developed using Borland C 3.1 on a PC
and has also been built with no problems on Alpha/OSF-1, DECstation/Ultrix
and Sun 4/SunOS hosts all using GCC. The definition of strict ANSI here
means compiling with gcc -Wall -Wstrict-prototypes -fno-common -pedantic
-ANSI and getting no warning messages.

RDP itself and the language processors it generates use standard symbol,
set and scanner modules. The RDP scanner is programmed by loading tokens
into the symbol table at the start of each run.  RDP produces complete 
runnable programs with built-in help information and
command line switches that are specified as part of the EBNF file.


see also byacc - Berkeley Yacc - see Archie

Early user with useful comments:
	John McLaughlin		 	|  Phone +353 1 702-1539
	DSG, Dept. of Computer Science,	|  Fax   +353 1 677-2204
	Trinity College,		|  email John.McLaughlin@dsg.cs.tcd.ie
	Dublin 2, Ireland		